nigerian pidgin
Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs
Saeed, Muhammed, Abdul-mageed, Muhammad, Shehata, Shady
Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.
Exploring Cultural Nuances in Emotion Perception Across 15 African Languages
Ahmad, Ibrahim Said, Dudy, Shiran, Belay, Tadesse Destaw, Abdulmumin, Idris, Yimam, Seid Muhie, Muhammad, Shamsuddeen Hassan, Church, Kenneth
Understanding how emotions are expressed across languages is vital for building culturally-aware and inclusive NLP systems. However, emotion expression in African languages is understudied, limiting the development of effective emotion detection tools in these languages. In this work, we present a cross-linguistic analysis of emotion expression in 15 African languages. We examine four key dimensions of emotion representation: text length, sentiment polarity, emotion co-occurrence, and intensity variations. Our findings reveal diverse language-specific patterns in emotional expression -- with Somali texts typically longer, while others like IsiZulu and Algerian Arabic show more concise emotional expression. We observe a higher prevalence of negative sentiment in several Nigerian languages compared to lower negativity in languages like IsiXhosa. Further, emotion co-occurrence analysis demonstrates strong cross-linguistic associations between specific emotion pairs (anger-disgust, sadness-fear), suggesting universal psychological connections. Intensity distributions show multimodal patterns with significant variations between language families; Bantu languages display similar yet distinct profiles, while Afroasiatic languages and Nigerian Pidgin demonstrate wider intensity ranges. These findings highlight the need for language-specific approaches to emotion detection while identifying opportunities for transfer learning across related languages.
Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin
Lin, Pin-Jie, Scholman, Merel, Saeed, Muhammed, Demberg, Vera
Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.
Text Categorization Can Enhance Domain-Agnostic Stopword Extraction
Turki, Houcemeddine, Etori, Naome A., Taieb, Mohamed Ali Hadj, Omotayo, Abdul-Hakeem, Emezue, Chris Chinenye, Aouicha, Mohamed Ben, Awokoya, Ayodele, Lawan, Falalu Ibrahim, Nixdorf, Doreen
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP), specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.
Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin
Lin, Pin-Jie, Saeed, Muhammed, Chang, Ernie, Scholman, Merel
Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.
How Good are Commercial Large Language Models on African Languages?
Recent advancements in Natural Language Processing (NLP) has led to the proliferation of large pretrained language models. These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. They have also been exposed as commercial APIs as a form of language-model-as-a-service, with great adoption. However, their performance on African languages is largely unknown. We present a preliminary analysis of commercial large language models on two tasks (machine translation and text classification) across eight African languages, spanning different language families and geographical areas. Our results suggest that commercial language models produce below-par performance on African languages. We also find that they perform better on text classification than machine translation. In general, our findings present a call-to-action to ensure African languages are well represented in commercial large language models, given their growing popularity.
UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis
Bhatia, Gagan, Adebara, Ife, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad
We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six language models, including further pertaining of some of these models as well as a final finetuning stage. Our best performing models achieve an F1-score of 70.36 on development data and an F1-score of 66.13 on test data. Unsurprisingly, our results demonstrate the effectiveness of transfer learning and fine-tuning techniques for sentiment analysis across multiple languages. Our approach can be applied to other sentiment analysis tasks in different languages and domains.
AfroLID: A Neural Language Identification Tool for African Languages
Adebara, Ife, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad, Inciarte, Alcides Alcoba
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.
MasakhaNER: Named Entity Recognition for African Languages
Adelani, David Ifeoluwa, Abbott, Jade, Neubig, Graham, D'souza, Daniel, Kreutzer, Julia, Lignos, Constantine, Palen-Michel, Chester, Buzaaba, Happy, Rijhwani, Shruti, Ruder, Sebastian, Mayhew, Stephen, Azime, Israel Abebe, Muhammad, Shamsuddeen, Emezue, Chris Chinenye, Nakatumba-Nabende, Joyce, Ogayo, Perez, Aremu, Anuoluwapo, Gitau, Catherine, Mbaye, Derguene, Alabi, Jesujoba, Yimam, Seid Muhie, Gwadabe, Tajuddeen, Ezeani, Ignatius, Niyongabo, Rubungo Andre, Mukiibi, Jonathan, Otiende, Verrah, Orife, Iroro, David, Davis, Ngom, Samba, Adewumi, Tosin, Rayson, Paul, Adeyemi, Mofetoluwa, Muriuki, Gerald, Anebi, Emmanuel, Chukwuneke, Chiamaka, Odu, Nkiruka, Wairagala, Eric Peter, Oyerinde, Samuel, Siro, Clemencia, Bateesa, Tobius Saul, Oloyede, Temilola, Wambui, Yvonne, Akinode, Victor, Nabagereka, Deborah, Katusiime, Maurice, Awokoya, Ayodele, MBOUP, Mouhamadane, Gebreyohannes, Dibora, Tilaye, Henok, Nwaike, Kelechi, Wolde, Degaga, Faye, Abdoulaye, Sibanda, Blessing, Ahia, Orevaoghene, Dossou, Bonaventure F. P., Ogueji, Kelechi, DIOP, Thierno Ibrahima, Diallo, Abdoulaye, Akinfaderin, Adewale, Marengereke, Tendai, Osei, Salomey
We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.
Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin
Ajisafe, Daniel, Adegboro, Oluwabukola, Oduntan, Esther, Arulogun, Tayo
Nigerian Pidgin remains one of the most popular languages in West Africa. With at least 75 million speakers along the West African coast, the language has spread to diasporic communities through Nigerian immigrants in England, Canada, and America, amongst others. In contrast, the language remains an under-resourced one in the field of natural language processing, particularly on speech recognition and translation tasks. In this work, we present the first parallel (speech-to-text) data on Nigerian pidgin. We also trained the first end-to-end speech recognition system (QuartzNet and Jasper model) on this language which were both optimized using Connectionist Temporal Classification (CTC) loss. With baseline results, we were able to achieve a low word error rate (WER) of 0.77% using a greedy decoder on our dataset. Finally, we open-source the data and code along with this publication in order to encourage future research in this direction.